flowchart LR Z[School] --> A Z --> G Z --> L A[Classroom 1] --> B(Student A) A --> C(Student B) A --> E(Student D) G[Classroom 2] --> H(Student H) G --> I(Student I) G --> K(Student K) L[Classroom 3] --> M(Student M)
Possible fixes (that don’t require throwing out data )
Clustered Standard Errors: using either the bootstrap or a sandwich estimator to adjust the standard errors up
Fixed effects: make a dummy for each group and include it in the model, (potentially) eliminating the autocorrelation while also controlling for unobserved differences across groups
Random effects
(the terminology here is inconsistent and contested, but no one has come up with better words yet)
Fixed Effects Models give every group its own y-intercept. The effects of “Texas” vs. “Tennessee” are assumed to be fixed and unrelated.
Random Effects Models assume that group intercepts are random draws from a common distribution. They can vary just like any other random variable, but they vary in a predictable way
You want to estimate the effect of time spent studying on student achievement
But you also want to account for differences between teachers: you suspect some are slightly better than others.
flowchart LR Z[School] --> A Z --> G Z --> L A[Classroom 1] --> B(Student A) A --> C(Student B) A --> E(Student D) G[Classroom 2] --> H(Student H) G --> I(Student I) G --> K(Student K) L[Classroom 3] --> M(Student M)
You calculate some averages and your results look like this:
| Instructor | Number of students | Average SAT score |
| Teacher 1 | 50 | 1029 |
| Teacher 2 | 25 | 1020 |
| Teacher 3 | 5 | 1120 |
Under a fixed effects assumption, each teacher’s effect is a fixed parameter, and knowing the mean score for teacher 1 and 2 doesn’t tell you anything about what the mean score should be for teacher 3.
Under a random effects assumption, the teacher effect is drawn from a larger distribution that has a central tendency and a variance. So you might look at the value for teacher 3 and conclude: “this seems implausible, in reality, this effect should probably be closer to the effect for the other two”
So we could assume something like this:
\[ SAT = \text{teacher effect} + b\_\text{studying effect} \times \text{study time} + \epsilon \] \[ \epsilon \sim \mathcal{N}(0, \sigma) \]
But we could assume a model with some kind of a “teacher effect” that also had a normal distribution like the error term:
\[ SAT = (\text{teacher effect}) + b\_\text{studying effect} \times \text{study time} + \epsilon\]
\[ \text{teacher effect} \sim \mathcal{N}(0, \tau) \]
In the second case, we estimate a model that slightly biases teacher effects towards the grand mean
One way to think about a random effect is that it represents a compromise between “no pooling” and “complete pooling” of regression results.
“No pooling” would mean estimating a totally separate model for each teacher
“Complete pooling” would mean estimating a model that totally ignores teacher effects.
“Partial pooling” would mean estimating a model that uses a weighted average of these cases
A partial pooling estimate favors “no pooling” estimate if group \(j\) is large and low-variance, and favors the “complete pooling” case if group \(j\) is small or high variance.
\[ \text{Estimate of random intercept } \alpha_j \approx \frac{\frac{n_j}{\sigma^2_j}}{\frac{n_j}{\sigma^2_j} + \frac{1}{\sigma^2_j}} (\bar{y_j} - \beta\bar{x}_j) + \frac{\frac{1}{\sigma^2_a}}{\frac{n_j}{\sigma^2_y} + \frac{1}{\sigma^2_a}} \mu_a \]
Random effects models will often give more plausible estimates for cases with small sample sizes - because they “pool” extreme values with small samples.
library(fixest)
# county level election results
fmodel<-feols(perc_gop ~ Income + White + AfAm + Hisp | state_name, data=counties)
regmodel<-feols(perc_gop ~ Income + White + AfAm + Hisp ,
data =counties, cluster ~ state_name)
mlist<-list("No fixed effects" = regmodel,
'Fixed effects' = fmodel)
modelsummary(mlist,
estimate = "{estimate}",
statistic = c("conf.int", 'std.error'),
conf_level = .95,
coef_rename = coefnames,
coef_omit ='Intercept',
gof_map = c('nobs','aic','bic', 'vcov.type',
'FE: state_name'
)
)| No fixed effects | Fixed effects | |
|---|---|---|
| Median income ($1000) | -0.374 | -0.257 |
| [-0.444, -0.304] | [-0.308, -0.206] | |
| (0.035) | (0.025) | |
| % White | 0.574 | 0.592 |
| [0.399, 0.748] | [0.525, 0.659] | |
| (0.087) | (0.034) | |
| % African American | -0.047 | -0.249 |
| [-0.247, 0.153] | [-0.362, -0.135] | |
| (0.100) | (0.057) | |
| % Hispanic | 0.427 | 0.275 |
| [0.181, 0.673] | [0.152, 0.398] | |
| (0.122) | (0.061) | |
| Num.Obs. | 3115 | 3115 |
| AIC | 24166.5 | 22175.2 |
| BIC | 24196.7 | 22507.6 |
| Std.Errors | by: state_name | by: state_name |
| FE: state_name | X |
To specify a random effect, we can use the lme4 package. The (1|state_name) says that I want to estimate random intercepts for each state:
| No fixed effects | Fixed effects | Random effects | |
|---|---|---|---|
| Median income ($1000) | -0.374 | -0.257 | -0.260 |
| [-0.444, -0.304] | [-0.308, -0.206] | [-0.281, -0.239] | |
| (0.035) | (0.025) | (0.011) | |
| % White | 0.574 | 0.592 | 0.588 |
| [0.399, 0.748] | [0.525, 0.659] | [0.543, 0.633] | |
| (0.087) | (0.034) | (0.023) | |
| % African American | -0.047 | -0.249 | -0.249 |
| [-0.247, 0.153] | [-0.362, -0.135] | [-0.305, -0.194] | |
| (0.100) | (0.057) | (0.028) | |
| % Hispanic | 0.427 | 0.275 | 0.272 |
| [0.181, 0.673] | [0.152, 0.398] | [0.219, 0.326] | |
| (0.122) | (0.061) | (0.027) | |
| SD (Observations) | 8.430 | ||
| Num.Obs. | 3115 | 3115 | 3115 |
| AIC | 24166.5 | 22175.2 | 22370.0 |
| BIC | 24196.7 | 22507.6 | 22412.3 |
| Std.Errors | by: state_name | by: state_name | |
| FE: state_name | X |
With fixed effects models, we can’t include any controls for group-level covariates because they’re already in the “fixed” effect:
library(rvest)
page<-read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union')
join_date<-page|>
html_element(css='table')|>
html_table()|>
select(2:4)|>
mutate(datestring = str_extract(
`Date(admitted or ratified)`, "([A-Z][a-z]+ [0-9]{1,2}, [0-9]{4})"),
date = mdy(datestring)
)
counties<-counties|>
left_join(join_date, by=join_by(state_name==State))|>
mutate(join_year = year(date) - 1787
)|>
drop_na(join_year)
fixefmodel<-fixest::feols(perc_gop ~ Income + White + AfAm + Hisp + join_year | state_name,
data=counties)
fixefmodelOLS estimation, Dep. Var.: perc_gop
Observations: 3,114
Fixed-effects: state_name: 50
Standard-errors: Clustered (state_name)
Estimate Std. Error t value Pr(>|t|)
Income -0.257136 0.025502 -10.08291 1.5294e-13 ***
White 0.591932 0.033514 17.66219 < 2.2e-16 ***
AfAm -0.248722 0.056598 -4.39450 5.9451e-05 ***
Hisp 0.274869 0.061121 4.49710 4.2332e-05 ***
... 1 variable was removed because of collinearity (join_year)
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8.35604 Adj. R2: 0.713165
Within R2: 0.567463
But random effects models can still estimate effects for group-level characteristics:
ranefmodel<-lmer(perc_gop ~ Income + White + AfAm + Hisp + join_year + (1|state_name),
data=counties
)
summary(ranefmodel)Linear mixed model fit by REML ['lmerMod']
Formula: perc_gop ~ Income + White + AfAm + Hisp + join_year + (1 | state_name)
Data: counties
REML criterion at convergence: 22348.1
Scaled residuals:
Min 1Q Median 3Q Max
-5.1757 -0.4745 0.0931 0.5828 3.3831
Random effects:
Groups Name Variance Std.Dev.
state_name (Intercept) 113.01 10.63
Residual 71.06 8.43
Number of obs: 3114, groups: state_name, 50
Fixed effects:
Estimate Std. Error t value
(Intercept) 31.27089 3.36819 9.284
Income -0.25919 0.01074 -24.124
White 0.59241 0.02327 25.463
AfAm -0.24204 0.02853 -8.482
Hisp 0.27559 0.02739 10.061
join_year 0.06814 0.03294 2.068
Correlation of Fixed Effects:
(Intr) Income White AfAm Hisp
Income -0.276
White -0.693 0.072
AfAm -0.630 0.154 0.830
Hisp -0.584 0.061 0.816 0.691
join_year -0.576 0.018 0.095 0.105 0.058
We can also estimate random slopes using this same basic approach.
ranefmodel<-lmer(perc_gop ~ White + AfAm + Hisp + join_year + (Income |state_name),
data=counties
)
library(marginaleffects)
nd<-datagrid(model=ranefmodel,
state_name = c("Alabama", "Wyoming", "Maryland", "Virginia", "Nevada", "North Carolina"),
"Income" = seq(17, 170, by=20))
plot_predictions(ranefmodel, newdata=nd, by=c('Income', 'state_name')) +
theme_bw() +
labs(color ='State', fill='State', y='% GOP vote')The democracy score of Paraguay in 2000 is a good predictor of current democracy scores in Paraguay.
Lack of independence poses similar problems to what we observed with clustered samples: we will generally overstate our level of confidence in estimates. We actually have very few truly independent observations here.
Time trends can also be a source of spuriousness. If any two variables have a time trend, they’ll appear to be related in a naive regression.
Correcting the standard errors uses the similar methods to the ones we’ve discussed elsewhere.
We could control for time linearly or with a polynomial or spline function.
We could treat time as a factor and include it as a fixed effect.
Would could treat time as a random effect term.
model1<-lm(v2x_polyarchy ~ log(e_pop), data=vdem)
model2<-lm(v2x_polyarchy ~ log(e_pop) + year, data=vdem)
model3<-lm(v2x_polyarchy ~ log(e_pop) + splines::ns(year), data=vdem)
mlist<-list(model1, model2, model3)
modelsummary(mlist)| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | -0.116 | -3.979 | -0.077 |
| (0.013) | (0.102) | (0.011) | |
| log(e_pop) | 0.068 | 0.011 | 0.011 |
| (0.002) | (0.002) | (0.002) | |
| year | 0.002 | ||
| (0.000) | |||
| splines = ns(year) | 0.639 | ||
| (0.017) | |||
| Num.Obs. | 3879 | 3879 | 3879 |
| R2 | 0.196 | 0.415 | 0.415 |
| R2 Adj. | 0.196 | 0.415 | 0.415 |
| AIC | -1607.0 | -2838.9 | -2838.9 |
| BIC | -1588.2 | -2813.8 | -2813.8 |
| Log.Lik. | 806.514 | 1423.428 | 1423.428 |
| RMSE | 0.20 | 0.17 | 0.17 |
If we have panel data (multiple groups over time) we can include year as a fixed effect. This has the effect of controlling for anything that varies for all groups over time.
Since we have both time periods AND groups, we can also add country fixed effects:
fmodel1<-feols(v2x_polyarchy ~ log(e_pop) | year, data=vdem)
fmodel2<-feols(v2x_polyarchy ~ log(e_pop) | country_name + year, data=vdem)
mlist<-list(fmodel1, fmodel2)
modelsummary(mlist)| (1) | (2) | |
|---|---|---|
| log(e_pop) | 0.003 | 0.101 |
| (0.001) | (0.037) | |
| Num.Obs. | 3879 | 3879 |
| R2 | 0.568 | 0.739 |
| R2 Adj. | 0.540 | 0.721 |
| R2 Within | 0.001 | 0.054 |
| R2 Within Adj. | 0.000 | 0.053 |
| AIC | -3557.6 | -5470.2 |
| BIC | -2104.5 | -3898.1 |
| RMSE | 0.14 | 0.11 |
| Std.Errors | by: year | by: country_name |
| FE: year | X | X |
| FE: country_name | X |
If data depends only on t-1, taking the first difference should address it
difference = current_value - prior_value
vdem|>
filter(country_name=='Paraguay')|>
mutate(polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
group='differenced'
)|>
bind_rows(vdem|>filter(country_name=="Paraguay")|>
mutate(group='original',
polyarchy = v2x_polyarchy)
)|>
ggplot(aes(x=year, y=polyarchy, color=group)) +
geom_line() +
facet_wrap(~group) +
theme_minimal() +
ylab('Paraguay democracy score')Taking the difference of both polyarchy and GDP in this model should eliminate the time trend effect. But it also alters what we’re actually regressing.
differenced_model<-vdem|>
filter(country_name=='Paraguay')|>
mutate(diff_polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc))
)|>
filter(year>=1995)|>
lm(diff_polyarchy ~ diff_gdp, data=_)
huxreg("original_model"= model, "differenced_model" =differenced_model)| original_model | differenced_model | |
|---|---|---|
| (Intercept) | 0.278 ** | 0.003 |
| (0.086) | (0.003) | |
| log(e_gdppc) | 0.112 ** | |
| (0.032) | ||
| diff_gdp | 0.002 | |
| (0.115) | ||
| N | 25 | 25 |
| R2 | 0.347 | 0.000 |
| logLik | 53.262 | 75.351 |
| AIC | -100.524 | -144.701 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
A change in the rate of change in the GDP with lead to a change in the rate of change of the democracy score.
Pro: can account for the primary time series problem without adding any new parameters
Con: is actually a different model! Estimates changes, not levels!
Con: transforms the data in a way that makes it difficult to get predictions
Another option is to include a lagged DV variable itself as a covariate. This is typically referred to as an autoregressive model.
One advantage of this approach is that you can simultaneously test multiple lags, which helps if you think there’s more than one source of trend.
ar_model2<-vdem|>
filter(country_name=='Paraguay')|>
mutate(lag_polyarchy = dplyr::lag(v2x_polyarchy),
lag_gdp = dplyr::lag(e_gdppc),
lag_gdp2 = dplyr::lag(e_gdppc, 2),
lag_polyarchy2= dplyr::lag(v2x_polyarchy, 2),
)|>
filter(year>=1995)|>
lm(v2x_polyarchy ~ lag_polyarchy + lag_polyarchy2+
log(e_gdppc) +
log(lag_gdp) +
log(lag_gdp2), data=_)
huxreg(ar_model2)| (1) | |
|---|---|
| (Intercept) | 0.129 * |
| (0.047) | |
| lag_polyarchy | 1.192 *** |
| (0.199) | |
| lag_polyarchy2 | -0.387 |
| (0.208) | |
| log(e_gdppc) | 0.112 |
| (0.149) | |
| log(lag_gdp) | 0.096 |
| (0.260) | |
| log(lag_gdp2) | -0.216 |
| (0.157) | |
| N | 25 |
| R2 | 0.936 |
| logLik | 82.297 |
| AIC | -150.593 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
Pro: allows a much more complex model specification and also tests for autocorrelation
Con: interpretation is weird, and it adds a lot of parameters and inevitable multicollinearity
Including a fixed effect for time can work if you have multiple observations for different periods or if you only suspect a cyclical trend
More complicated autocorrelation structures can be modeled with a regression, and then OLS can be used on the residuals from that model
ACF plots can be used to visualize autocorrelations. If the lagged correlation is above the blue line, then there’s statistically meaningful autocorrelation.
vdem|>
filter(country_name=='Paraguay')|>
arrange(year)|>
filter(year >1990)|>
mutate(diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc)))|>
filter(!is.na(diff_gdp))|>
with(acf(diff_gdp, lag.max = 5, plot = T))vdem_p<-vdem|>
filter(country_name=='Paraguay')|>
arrange(year)|>
filter(year >1990)|>
mutate(log_gdp = log(e_gdppc))|>
filter(!is.na(e_gdppc))|>
with(acf(log_gdp, lag.max = 5, plot = T))